Incremental Computation of Linear Machine Learning Models in Parallel Database Systems
ثبت نشده
چکیده
We study the serial and parallel computation of Γ (Gamma), a comprehensive data summarization matrix for linear machine learning models widely used in big data analytics. We prove that computing Gamma can be reduced to a single matrix multiplication with the data set, where such multiplication can be evaluated as a sum of vector outer products, which enables incremental and parallel computation, essential features for scalable computation. By exploiting Gamma, iterative algorithms are changed to work in two phases: (1) Incremental-parallel data set summarization (i.e. in one scan and distributive); (2) Iteration in main memory exploiting the summarization matrix in intermediate matrix computations (i.e. reducing number of scans). Assuming the machine learning model is based on Gaussian distributions, we show that the covariance (and correlation) matrix, present in every Gaussian model, can be derived directly from Gamma. Therefore, many intermediate computations on large matrices collapse to computations based on Gamma, a much smaller matrix. We justify it is necessary to develop specialized database algorithms for dense and sparse matrices, respectively, and we introduce a density threshold to decide either algorithm. Assuming a distributed memory model (i.e. shared-nothing) and a larger number of points than processing nodes, we show computing Gamma exhibits close to linear speedup. We study how to compute Gamma with existing database systems processing mechanisms and their impact on time complexity. At the same time we also highlight weaknesses and limitations of our proposal.
منابع مشابه
Two-stage fuzzy-stochastic programming for parallel machine scheduling problem with machine deterioration and operator learning effect
This paper deals with the determination of machine numbers and production schedules in manufacturing environments. In this line, a two-stage fuzzy stochastic programming model is discussed with fuzzy processing times where both deterioration and learning effects are evaluated simultaneously. The first stage focuses on the type and number of machines in order to minimize the total costs associat...
متن کاملTime Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix
We study the serial and parallel computation of Γ (Gamma), a comprehensive data summarization matrix for linear Gaussian models, widely used in big data analytics. Computing Gamma can be reduced to a single matrix multiplication with the data set, where such multiplication can be evaluated as a sum of vector outer products, which enables incremental and parallel computation, essential features ...
متن کاملBig Data Analytics in Bioinformatics: A Machine Learning Perspective
Bioinformatics research is characterized by voluminous and incremental datasets and complex data analytics methods. The machine learning methods used in bioinformatics are iterative and parallel. These methods can be scaled to handle big data using the distributed and parallel computing technologies. Usually big data tools perform computation in batch-mode and are not optimized for iterative pr...
متن کاملFast and Eecient Algorithms for Video Compression and Rate Control
grated to the United States of America in 1975 with his parents, Dzuyet D. Hoang and Tien T. Tran, and two sisters. He now has three sisters and one brother. They have been living in Harvey, Louisiana. a Fulbright Scholar, and an IBM Faculty Development Awardee. He is coauthor of the book Design and Analysis of Coalesced Hashing and is coholder of patents in the areas of external sorting, predi...
متن کاملMachine learning algorithms in air quality modeling
Modern studies in the field of environment science and engineering show that deterministic models struggle to capture the relationship between the concentration of atmospheric pollutants and their emission sources. The recent advances in statistical modeling based on machine learning approaches have emerged as solution to tackle these issues. It is a fact that, input variable type largely affec...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016